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Abstract 

This study presents the processes of developing and establishing reliability and validity of a 
reading test by administering an integrative approach as conventional reliability and validity 
measures superficially reveals the difficulty of a reading test. In this respect, analysing 
vocabulary frequency of the test is regarded as a more eligible way of measuring validity. A 
study was conducted at Dokuz Eyliil University and Canakkale Onsekiz Mart University with 
three colleagues and 100 undergraduate students to establish validity and reliability along 
with readability and vocabulary frequency of a 32-item reading test which was developed by 
the researcher. Such detailed assessment is highly recommended for researchers who are in 
need of preparing pre and post tests which are different from each other. 

Keywords: assessing reading, reliability, validity, multiple choice, item analysis, item 
difficulty 


In this article, it might be helpful to identify the differences among the three confusing 
terminologies of ‘assessment’, ‘evaluation’, and ‘testing’. As identified by Noda (2003), 
assessment requires administering examinations to learn about the students’ performances 
along with observing them in the classroom activities; however evaluation has nothing to do 
with formal examinations since it deals with the students’ performances in the classroom 
during the activities. On the other hand, testing requires administering specifically prepared 
examinations and is not interested in students’ performances in the activities. Fry (1977a) 
groups comprehension questions in two broad categories as objective and subjective ones. The 
former can be regarded as Pearson and Johnson’s (1978) textually explicit questions; and the 
latter as textually implicit ones. Then, an objective or a textually explicit question provides 
both information about question and correct answer whereas a subjective or textually implicit 
question presents the correct answer only through combining a set of related sentences. 

It would be wise to remember that it is unfeasible to assess readers’ comprehension of 
the text since reading comprehension “is totally unobservable” therefore requires analyzing 
‘behaviour’ (H. D. Brown, 2001, p. 315). Such analysis depends on several actions such as 
doing, choosing, transferring, answering, condensing, extending, duplicating, modelling, and 
conversing. To H. D. Brown, these actions can be observed in acting physically, selecting 
among options, summarizing the text, responding comprehension questions, outlining, adding 
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an end to a story, translating into LI, following instructions to assemble a toy, and taking part 
in a conversation (p. 316). 

Alderson (2000) concludes that although reading is regarded as a process, it is quite 
common to assess readers’ comprehension with reference to product not process as product is 
much easier than process in terms of investigation of readers’ comprehension. In this respect, 
Alderson reveals the most common techniques in testing reading as gap-filling, cloze, 
multiple-choice, summary, dichotomous-item, editing, question-answer, matching, and 
ordering tests (See Alderson, 2000 and Razi, 2005 & 2007 for a detailed account of these test 
types). 

Evaluating test quality 

To ensure reliability and validity of reading tests which are constructed by the 
integration of above mentioned techniques, testers refer to a number of various analyses that 
will be explained below. 

Reliability 

Noda (2003) indicates reliability as a crucial element of standardized testing and 
points out that test-takers receive almost the same mark when they are delivered a reliable test 
for multiple times. This implies that if a reading test is reliable then the tester is sure that the 
test is consistent and test-takers perform almost the same at all times the test is delivered. 
Noda highlights that group performance is also another criterion that needs to be taken into 
consideration while dealing with reliability. If a group of test-takers perform much better or 
much worse in any test when compared with their previous scores on similar tests, then such a 
test cannot be regarded as reliable. 

The most common ways of assessing reliability is measuring ‘stability or test-retest’, 
‘alternate form’ (Kaplan & Saccuzzo, 2001), ‘internal consistency - Alpha’ (Aiken, 2003), 
and ‘interrater reliability or interrater objectivity’ (Goodwin, 2001). To measure stability of a 
test, the tester delivers the same test twice with a probable interval of two weeks and 
calculates the correlation between these two tests in which reliability is reflected. On the other 
hand, by producing two versions of the same test in which the items differ from each other 
very slightly, the tester is able to calculate reliability by working on the correlations between 
these two tests. Thirdly, internal consistency is also regarded as another crucial element of 
reliability. Such consistency presumes that a test-taker’s performance is similar in items 
which are similar to each other. Fourthly, interrater reliability reveals the consistency of two 
or more raters’ scores on the same performance. 

The marking procedure needs to be quite objective to provide reliability as it is also 
essential for reliable tests to be marked with almost the same results by different markers (S. 
Brown, 1994). To provide reliability, test-takers are required to use test techniques which are 
familiar to the test-takers; otherwise failure may occur as a result of unfamiliarity with the 
question types which results in an unreliable test. Noda (2003) does not approve 
administration of a single long lasting test at the end of a course as it decreases reliability of 
the test; instead she recommends daily evaluations of the readers for reliable results. S. Brown 
also calls attention to a precarious attempt to increase reliability of tests. She indicates that 
testers restrict their questions to objectively marked items such as multiple choice tests which 
in turn results in failure in the test’s validity. 
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Validity 

It is possible to regard a test valid if it measures what it is expected to measure in an 
efficient way (Crocker & Algina, 1986). The most common evidences of validity are ‘face’, 
‘content’, ‘criterion’, ‘construct’ and ‘discriminative’ and ‘generalizability’ (Carducci, 2009). 
Face validity compares the test with what it is supposed to be assessing in terms of its facet 
whereas content validity questions the content of the test and compares its appropriateness 
with the instructional objectives. Moreover, criterion validity investigates the scores of the 
test and compares them to that of an external criterion while construct validity aims to match 
a theoretical concept with the test by following these three steps of specifying theoretical 
relations, examining empirical relations, and then interpreting them (Carmines & Zeller, 
1991). Discriminate validity ensures that the test is not related with other instruments 
excessively (Campbell & Fiske, 1959) and the validity of generalizability indicates how 
appropriate the test is to test-takers in a variety of settings. 

Validity is supposed to be more important than reliability as a reliable test may not be 
valid. For example, a reliable reading test which consists of gap filling questions on 
grammatical items cannot be regarded valid for assessing reading comprehension. Noda 
(2003) notes that the texts and the tasks in the test are the factors which identify validity of the 
test and she considers independence of modalities as an important element which implies that 
testers need to isolate the tested language skill from the others. Unfortunately, a considerable 
number of reading professionals prefer to integrate the other language skills into reading tests 
as it is quite common to encounter summary questions followed by a text. In such cases a very 
crucial question arises: “What is the aim of the tester?” If the answer is testing reading 
comprehension then is this an effective way of testing readers’ comprehension on a 
productive skill of writing? Therefore, such tests cannot be considered to be valid. 

Standard error 

Basically readers are categorised as good and poor ones; moreover it is also possible 
to add one more group of readers to these two namely mediocre. Good readers are expected to 
achieve higher results whereas poor ones are expected to achieve lower results. However, 
mediocre readers are expected to survive if they are delivered valid and reliable test. In this 
respect, standard error identifies their possibility of survival, in other words being successful 
in the test. Noda (2003) considers administration of a single long lasting test at the end of a 
course as an ill-inspired attempt as standard error cannot be taken into consideration in such a 
single-test. 

Readability analysis 

Readability scores aim to measure the linguistic complexity of texts (Alderson, 2000) 
and to materialize this a number of readability formulas have been developed to assess the 
text’s difficulty by considering them as products (Wallace, 1992) with reference to the lengths 
of words and sentences in them (Fry, 1977b). For example, Fry’s formulate works on a 
sample of 100 words which come from the beginning, middle, and the end of the text; and 
calculates the difficulty in positive correlation with word and sentence lengths. There are also 
fonnulas which aim at estimating lexical load by identifying frequencies of words that appear 
in a text or by examining their lengths. Another approach to assign readability of a text is 
investigating the sentence lengths in it. However, Alderson regards it as a controversial issue 
since adding new words to a sentence may simplify its comprehension. Alderson concludes 
that it is almost impossible to identify the difficulty of a text absolutely, therefore he 
recommends use of authentic texts in appropriate to the aim. 
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However, Chastain (1988) revises the validity of readability analysis and reveals that it 
would be unwise to blame linguistic complexity on its own for reading comprehension 
problems as the process of reading is regarded as an interactive one in which readers’ 
schemata and their interest in reading the text are considered to be major contributors to the 
understanding of the texts. Wallace (1992) argues that also reduced clauses need to be 
regarded since they shorten sentences by creating difficulty. Alderson (2000) also 
expostulates the use of readability analysis as he regards it as a product approach to reading 
with the two limitations of variation in the product and also method which is used to measure 
the product. 

Corpus linguistics 

Although definition of a corpus regards any collection which includes more than one 
text, in relation to modem linguistics the four characteristics of ‘sampling and 
representatives’, ‘finite size’, ‘machine-readable form’ and ‘a standard reference’ should also 
be incorporated in corpus studies (McEnery & Wilson, 1996). 

Conrad (2005, p. 394) reveals that the corpus is constituted of both written texts and 
transcriptions of speeches. She calls attention to the importance of authenticity of the 
materials in the corpus as it is a “collection of naturally occurring texts that is stored in 
electronic form” rather than the materials which are prepared for teaching language. Conrad 
maintains that technological advances enabled to achieve large scale corpora consisting of 
hundreds of millions of words compared to one-million word corpora in the 1970s. Such 
advances encourage dictionary writers to give frequency of words. It was Frith (1957) who 
first introduces the tenn of collocation; however, his proposal is materialized by the advances 
in corpus linguistics. Such advances undoubtedly assist Lewis (1993) to give birth to the 
lexical approach where the emphasis is on building lexical units. Richards and Rodgers 
(2001) indicate that apart from collocations, binomials, trinomials, idioms, similes, 
connectives, and conversational gambits also appear in language. 

Bias and testing reading 

As discussed earlier, any quality test is required to be valid and reliable along with an 
acceptable standard error value. Besides, bias can be regarded as a concept to be removed 
from a quality test (Murphy, 1994) since it prevents testers to evaluate test-takers’ responses 
in a fair way. In order to identify whether the anomalous looking question is biased or not, 
Murphy recommends testers to examine test-takers’ responses by detennining any ‘atypical’ 
perfonnances. To make the concept more comprehensible, Murphy gives an example from 
Hannon and McNally (1986) where they examine a biased reading question as presented 
below. 

An example from the reading text: 

The man was very late and just managed to jump ... the bus as it was pulling away 

from the stop. 

1 at 

2 up 

3 on 

4 by 

(Murphy, 1994, p. 297) 

Over half of the test-takers failed to choose the correct answer for the above 
mentioned question because of their insufficient knowledge of colloquial English, rather than 
the inability in reading comprehension. An interesting conclusion on biased results comes 
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from Capel, Leask, and Turner (1995) who indicate that multiple choice questions as in 
Hannon and McNally’ (1986) example, seem to leverage males over females. 

The study 

Testers generally aim at establishing reliability and validity for their tests by 
administering the analyses that were discussed above under the subtitles of reliability and 
validity. However, if the aim is testing reading, then testers also intend to refer to various 
readability analyses to identify the difficulty of the texts in their tests. Readability fonnulas 
have long been criticised since they merely take into consideration word and sentence lengths 
(Wallace, 1992). Then, apart from readability analyses, there arouses a need to investigate the 
other aspects in the text. In this respect corpus linguistics studies may assist reading testers. 

Although reliability and validity analyses are regarded as standard procedures, 
calculating word frequency is not taken into consideration. Therefore, the present study aims 
to establish validity and reliability along with readability and vocabulary frequency of a 
reading test which was developed by the researcher. In this respect, the researcher aims to 
produce a more reliable and valid reading test. Therefore, the present study aimed to answer 
whether it was possible to evaluate reading tests in terms of vocabulary frequency and 
integrate this with the other means of reliability and validity measures. 

The student participants of the study were instructed to answer questions in 90 
minutes. They were also reminded that their wrong answers did not have any impact on their 
score from that test. Besides, they were not allowed to use dictionaries during the test. 

Setting 

The validity was conducted in the ELT Departments of Dokuz Eyliil University and 
Canakkale Onsekiz Mart University with three colleagues whereas the reliability was 
conducted in the ELT Department of Canakkale Onsekiz Mart University with a number of 
100 undergraduate students over the fall semester of the 2008-2009 academic year. ELT 
Department was suitable for this study because of the high English language proficiency of 
the participants. 

Participants 

The study consists of 100 students from preparatory, freshman, sophomore, junior, and 
senior classes at the average age of 20. All the participants were considered advanced Turkish 
learners of English as they had to take the placement test of Foreign Language Examination 
(YDS) which is administered once every year by Higher Education Council Students 
Selection and Placement Centre of Turkey (OSYM), to study at the ELT Department. Apart 
from YDS, in order to enrol first year courses, the students were required to take an 
exemption examination on the registration of the department which tested their proficiency in 
English by dealing with all language skills along with grammar and vocabulary. 

As the department of ELT is a female dominant one, a vast majority of the participants 
were females. Gender distribution of the participants in the study is shown in Table 1. 
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Table 1 


Gender Distribution of Participants 


Classes 

Female 

Male 

Class Total 

Preparatory 

16 

4 

20 

Freshman 

14 

6 

20 

Sophomore 

15 

5 

20 

Junior 

15 

5 

20 

Senior 

14 

6 

20 

Total 

74 

26 

100 


Moreover, oral permission had previously been sought from the students to use their 
test results for research purposes. On this occasion, they were reminded that the data to be 
collected was for research purposes only; it would be kept confidential, and would have no 
bearing on assessment of their courses. 

Instrument 

A four-section, 32-item reading test was developed by the researcher to test reading 
comprehension. There were four-option multiple choice questions in the first, third, and 
fourth sections of the test. Such questions were a combination of Pearson and Johnson’s 
(1978) textually explicit, textually implicit, and scriptally implicit questions along with Fry’s 
(1977a) objective and subjective questions. The second section of the test presented paragraph 
matching questions. As proposed by Alderson (2000), there were more options in the 
matching section than the task demanded. All the texts in the test were taken from real life 
reading materials and adjusted for the test. All the questions in the test were prepared by the 
researcher. The reading test was very similar to University of Cambridge Local Examinations 
Syndicate (UCLES) Examinations in English as a Foreign Language Certificate of 
Proficiency in English (CPE) Reading Paper, apart from the replacement of a section. 

Findings and discussion 

Validity of the reading test 

To avoid producing test items which do not require reading the text as proposed by 
Hadley (2003), the multiple-choice questions were answered without reading the test by an 
Associate Professor at the ELT Department of Dokuz Eyliil University. Then, to provide other 
validity measures of the reading test, the questions and the texts in the reading test were 
evaluated by the same colleague in terms of their content, face, and criterion-related 
validities. Since the questions in the test focused on a variety of aspects regarding reading 
comprehension such as ‘implication’, ‘opinion’, ‘main idea’, ‘detail’, ‘attitude’, ‘cohesion’, 
‘coherence’, ‘text structure’, ‘global meaning’, ‘comparison’, and ‘reference’ in either 
multiple-choice or multiple-matching style, the test was regarded to be valid in terms of its 
content. Moreover, as the participants of the study were familiar with such texts and question 
types, it was also valid in terms of its face. As the reading test was quite similar to UCLES 
CPE Reading Paper, apart from the replacement of a section in accordance with the aim of the 
researcher, it was regarded valid in terms of criterion-related test. 

The reading test was also evaluated by two native English speaking colleagues of 
Canakkalc Onsekiz Mart University, one of whom employed as an Instructor of English at the 
Department of ELT and the other employed as an English Language Specialist. Both the texts 
and the questions in the test were proofread and also the texts were ranked from 1 to 10 
according to their difficulty. These two native speakers’ recommendations on the language of 
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the texts and questions were taken into consideration. Besides, the mean values of the two 
native speakers’ text difficulty scores gave an overall idea about the difficulty of the texts 
which are indicated in Table 2. 

Table 2 

Text Difficulty Evaluation of Native Speakers _ 

Text Difficulty 


Reading Test _ Native Speaker 1 Native Speaker 2 Mean 



Text 1 

8 

8 


8 



Text 2 

9 

8 


8.5 


Part 1 

Text 3 

7 

5 


6 



Text 4 

6 

5 


5.5 



Mean 


7.5 

6.5 


7 

Part 2 



8 

6 


7 

Part 3 



10 

8 


9 

Part 4 



7 

7 


7 

Mean 


8.13 

6.88 


7.5 


The native speakers’ evaluation of the texts indicates that the language of the texts 
shows a difficulty level ranging from 5 to 10 on a difficulty scale of 10. The two native 
speakers’ evaluation of the texts shows a high and significant correlation (r = .782; p < .05). 
Although there are some slight differences between the difficulty levels of the texts in 
different sections of the test, this does not affect the validity of it since each section functions 
independently it the test. To conclude, an overall score of 7.5 on a 10 point scale may indicate 
that the test is appropriate to be used at proficiency level. 

Moreover, readability analyses were administered for each text in the reading test by 
using Microsoft® Word for the scores of counts and averages. Readability analyses were 
presented with the results of standard tests namely Flesch reading ease and Flesch-Kincaid 
grade level which were calculated by using Microsoft® Word. Besides Fog scale level was 
calculated online at http://www.readabilityformulas.com/free-readability-formula- 
assessment.php along with and SMOG (Simple Measure of Gobbledygook) readability 
fonnula which was calculated online at http://www.harrymclaughlin.com/SMOG.htm. 

Table 3 presents the readability scores of the texts along with the details on counts and 
averages and it indicates that the texts in the reading test consist a total of 4,068 words in four 
parts. Readability analyses were presented with the results of standard tests namely Flesch 
reading ease, Flesch-Kincaid grade level, Fog scale level, and SMOG readability fonnula. 
Firstly, Flesch reading ease scores which measure readability by using the average sentence 
length and the average number of syllables per word indicate similarities among the texts in 
the test. As higher rating scores indicate the easiness of texts and the scores between 30 and 
49 are considered to be difficult in Flesch reading ease scale (McLaughlin, 1969); all the texts 
are attributed to be difficult with reference to Flesch reading ease scores. However, Flesch 
reading ease scores are attributed to be most reliable for upper elementary and secondary 
reading materials. 

Secondly, Flesch- Kincaid grade level indicates the grade level of a text by measuring 
textual difficulty and the scores above 12 are demonstrated as 12 in Flesch- Kincaid grade 
level, Table 3 points out that all the texts in the reading test appear at the level of 12 or above. 
It is worth to mention that Flesch- Kincaid grade level stands for a grade-school level. 
Therefore, like Flesch reading ease scores, Flesch- Kincaid grade level scores are also 
considered to be reliable for upper elementary and secondary reading materials. 
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Table 3 


Scores of 'Readability Analyses 








Reading Test 








Part 1 








Readability 

Text 

Text 

Text 

Text 

Part 

Part 

Part 

Part 



Analyses 

1 

2 

3 

4 

1 

2 

3 

4 

Total/Mean 


Words 

247 

265 

279 

215 

1006 

1109 

708 

1245 

4068 

+-> 

Characters 

1188 

1470 

1451 

1152 

5261 

5827 

3652 

6198 

20938 

o 

o 

Paragraphs 

5 

4 

4 

3 

16 

18 

8 

10 

52 

Sentences 

10 

13 

10 

10 

43 

54 

30 

58 

185 


Sentences 











per 

2.5 

4.3 

3.3 

5.0 

3.78 

3.6 

4.2 

6.4 

4.49 

<D 

OJ) 

paragraph 










a 

CD 

> 

<c 

Words per 
sentences 

24.2 

20.2 

27.6 

20.7 

23.18 

20.1 

23.2 

21.4 

21.97 


Characters 
per word 

4.7 

5.3 

5.0 

5.1 

5.03 

5.1 

5.0 

4.8 

4.98 


Passive 

Sentences 

Flesch 

20% 

30% 

50% 

0% 

25% 

20% 

6% 

15% 

16.5% 


reading 

49.0 

30.1 

38.7 

37.4 

38.8 

36.2 

42.4 

40.7 

39.53 


ease 










3 

Flesch- 










Is 

Kincaid 

12.0 

12.0 

12.0 

12.0 

12.0 

12.0 

12.0 

12.0 

12.0 

od 

<D 

grade level 










Pi 

Fog scale 
level 

14.10 

16.94 

12.63 

9.11 

13.2 

13.84 

15.20 

12.41 

13.66 


SMOG 











readability 

formula 

14.49 

15.53 

14.75 

15.85 

15.16 

15.14 

15.77 

15.14 

15.30 


Although the scores of two readability analyses of Flesch reading ease and Flesch- 
Kincaid grade level provide a general idea about the texts, they cannot be considered 
appropriate at proficient level. Therefore, subsequent analyses are required such as the third 
analysis of Fog scale level which is mainly used to measure readability of non-educational 
texts. Similar to the Flesch scale, the Fog scale also compares syllables and sentence lengths 
and words with three or more syllables are considered to be ‘foggy’. Fog scale level scores 
indicate that the texts are hard and almost difficult to understand which makes it an 
appropriate instrument for proficient level of EFL learners. 

Moreover, a fourth readability analysis of SMOG readability formula was 
administered to predict the difficulty level of texts. Like the Fog scale, the SMOG formula 
also identifies words which have three or more syllables as polysyllabic which make the text 
difficult to read. The average SMOG level of the texts indicates that, the reading test is at a 
level between college and university degree with reference to the scale provided by 
McLaughlin (1969). This score also makes the reading test an appropriate instrument to test 
reading comprehension at proficient level. 

The scores of readability analyses gave a clear picture of the texts’ difficulty levels by 
examining them with reference to linguistic features. However, the nature of such readability 
analyses does not allow the contextual investigation of lexical items in the text. Unavoidably, 
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such a factor plays a crucial role in reading comprehension. Therefore, the lexical items in the 
reading test were also evaluated. 

To enable this evaluation, all the vocabulary in the texts of the reading test was listed 
except for numbers and proper nouns. Repetitive occurrences of existing words were not 
taken into consideration. Then, these words in the list were ranked according to their 
frequency of usage by the help of a computer programme WordCount™ which presents the 
86,800 most frequently used English words by ranking them in an order of commonness 
where the data is based on the British National Corpus®. The words which do not appear in 
WordCount™ were ranked in the 86,801"' place in the list. Table 4 presents the mean values 
of frequency of the words in the reading test. 

Table 4 

Mean Value of Frequency of Words in the Reading Test _ 


Reading Test Frequency of Words 



Text 1 

3009.24 



Text 2 

3438.70 


Part 1 

Text 3 

2261.30 



Text 4 

2517.53 



Mean 


2806.70 

Part 2 



6740.02 

Part 3 



3399.97 

Part 4 



3987.75 

Mean 



4233.61 


Table 4 above reveals that on average the words appear in a frequency rank of 4234 in 
the reading test. This average score implies that the texts include less frequently used words 
along with very common ones. Moreover, the frequencies of the words in the test show high 
and significant correlations between Part 1 and Part 2 (r = .503; p < .01); Part 1 and Part 3 (r 
= .545; p < .01); Part 1 and Part 4 (r = .840; p < .01); Part 2 and Part 3 (r = .625; p < .01); Part 
2 and Part 4 (r = .824; p< . 01); and Part 3 and Part 4 (r = .439; p < .01). 

Table 5 displays the evaluation scores of the reading test for its validity in tenns of 
difficulty levels of native speakers, readability scores, and word frequency analyses. 

Table 5 

Reading Test Validity Evaluation _ 

Native speaker Readability 

F/esch- Word 


Reading Test 

1 

2 

Mean 

Flesch 

Kincaid 

Fog 

SMOG 

frequency 


Text 1 

8 

8 

8 

49.0 

12.0 

14.10 

14.49 

3009.24 


Text 2 

9 

8 

8.5 

30.1 

12.0 

16.94 

15.53 

3438.70 

Part 1 

Text 3 

7 

5 

6 

38.7 

12.0 

12.63 

14.75 

2261.30 


Text 4 

6 

5 

5.5 

37.4 

12.0 

9.11 

15.85 

2517.53 


Mean 

7.5 

6.5 

7 

38.8 

12.0 

13.20 

15.16 

2806.70 

Part 2 


8 

6 

7 

36.2 

12.0 

13.84 

15.14 

6740.02 

Part 3 


10 

8 

9 

42.4 

12.0 

15.20 

15.77 

3399.97 

Part 4 


7 

7 

7 

40.7 

12.0 

12.41 

15.14 

3987.75 

Mean 


8.13 

6.88 

7.5 

39.53 

12.0 

13.66 

15.30 

4233.61 


To conclude with reference to Table 5, four parts of the reading test show similarities 
in terms of the scores of difficulty levels of native speakers, readability analyses, and word 
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frequency levels. The scores indicate it as an appropriate material to be used with proficient 
readers of EFL; therefore it can be considered to be valid. 

Reliability of the reading test: 

To test the reliability of the reading test, item analysis was employed to the 32- 
questioned reading test which was administered to a group of 100 participants in the 
department of ELT for item analysis in terms of item difficulty and item discrimination. 

To administer item analysis process, first the participants’ answers were marked by the 
researcher. The marking process was completely objective since it was done by computer. To 
enable this, the researcher fonnulized an Excel spreadsheet to feed the data into computer. In 
this respect, the correct answers were given ‘1’ point where the wrong ones were given ‘0’ 
point. As all the items were totally objective in terms of marking process, there was no need 
for an interrater reliability score. Then the participants’ total scores’ were listed in descending 
order. The answers of the 27 participants who were at the top of the list and the 27 
participants who were at the bottom of the list were taken into consideration in the next step. 
Later each item in the reading test was calculated in tenns of correct answers in the top 27- 
participant group and in the bottom 27-participant group. 

To calculate item difficulty the number of correct answers in the top 27-participant 
group was added to the number of correct answers in the bottom 27-participant group. The 
sum was divided by 54 and indicated the item difficulty score for each item in the reading 
test. 

On the other hand, to calculate item discrimination, the number of correct answers in 
the bottom 27-participant group was subtracted from the number of correct answers in the top 
27-participant group. The amount was then divided by 27 and indicated ‘item discrimination’. 
Table 6 shows the rationale used for the evaluation of the items in the reading test. 

Table 6 


Rationale for the Item Analysis Process 


Group 

(p) Item 
Difficulty 

(r) Item 
Discrimination 

Interpretation 

1 

>0.90 

No value 

Preferable if teaching process is effective 

2 

0.60-0.90 

>0.20 

Practically appropriate item 

3 

0.60-0.90 

<0.20 

Needs to be revised 

A formidable but discriminative item: 

4 

<0.60 

>0.20 

Appropriate for high standards 

A formidable but non-discriminative 

5 

<0.60 

<0.20 

item: Needs to be removed 


The 32 items in the reading test were evaluated with reference to the rationale 
presented in Table 6. The results in Table 7 indicate that all the items in the reading test, 
except from the items 25 and 29 were appropriate to be used in the test. Therefore, these two 
items were removed from the reading test. The answers of the participants on the remaining 
30 items were then analyzed to find out the reliability of the reading test. Reliability analysis 
revealed a Cronbach’s alpha score of a = .81 over 30 items in the reading test. This score 
indicates that the 30-question reading test is acceptably reliable. Table 7 presents the results 
of reading test on item analysis in tenns of‘item difficulty’ and ‘item discrimination’. 
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Table 7 


Item Analysis of the Reading Test 


Items 

(P) 

Item Difficulty 

(r) 

Item Discrimination 

Group 

Item 1 

0.796296 

0.407407 

2 

Item 2 

0.870370 

0.259259 

2 

Item 3 

0.796296 

0.407407 

2 

Item 4 

0.851852 

0.296296 

2 

Item 5 

0.777778 

0.444444 

2 

Item 6 

0.740741 

0.444444 

2 

Item 7 

0.611111 

0.703704 

2 

Item 8 

0.796296 

0.407407 

2 

Item 9 

0.629630 

0.592593 

2 

Item 10 

0.648148 

0.333333 

2 

Item 11 

0.611111 

0.629630 

2 

Item 12 

0.611111 

0.259259 

2 

Item 13 

0.611111 

0.407407 

2 

Item 14 

0.611111 

0.333333 

2 

Item 15 

0.814815 

0.370370 

2 

Item 16 

0.648148 

0.555556 

2 

Item 17 

0.777778 

0.296296 

2 

Item 18 

0.611111 

0.481481 

2 

Item 19 

0.759259 

0.407407 

2 

Item 20 

0.722222 

0.333333 

2 

Item 21 

0.629630 

0.592593 

2 

Item 22 

0.611111 

0.259259 

2 

Item 23 

0.685185 

0.629630 

2 

Item 24 

0.722222 

0.555556 

2 

Item 25 

1 

0 

1 

Item 26 

0.611111 

0.481481 

2 

Item 27 

0.740741 

0.296296 

2 

Item 28 

0.759259 

0.259259 

2 

Item 29 

0.462963 

0.111111 

5 

Item 30 

0.740741 

0.444444 

2 

Item 31 

0.740741 

0.518519 

2 

Item 32 

0.648148 

0.555556 

2 


Conclusion 

This paper includes infonnation about establishing the reliability and validity of a 
reading test, as well as a description of the development procedure of the test. After such 
detailed validity and reliability analyses, it might be possible to report about a reading test’s 
restrictions, such as readability of the texts, what grades the test is appropriate for, and the 
how discriminative the questions in the test are. 

The study aimed at describing the process of establishing validity and reliability of a 
reading test in detail with the intention of providing valuable information about multiple 
assessment criteria both to teachers of reading who rely on reading tests to detennine reading 
skills of their students and researchers who are in need of reliable reading assessment tools for 
their pre and post tests. Establishing such validity and reliability analyses might also be 
beneficial for testers as they depend on assessment tools for making decisions about the 
candidates. 
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In order to offer any opinions about the quality of a reading test, some assessment 
criteria are supposed to be administered. Assessing any reading test with just a single criterion 
may not hinder realistic results. Therefore, evaluating reading tests in terms of multiple 
factors may assist teachers, researchers, and testers to decide for themselves which reading 
test is most appropriate for their particular needs. 

The general tendency to assess a reading test is dealing with its validity and also 
reliability. Such an assessment requires reading tests which are free of bias and distortion. 
However, such analyses do not necessarily reveal exact difficulty of the texts in the test as 
reliability focuses on question items rather than the texts in the test. In addition to these two, 
calculating readability also gives an idea about the difficulty of a text. Nevertheless, 
readability analyses can also be considered superficial as they merely deal with either word or 
sentence lengths. Then, there arouses the necessity of scrutinizing the words in the texts of a 
reading test. Therefore, vocabulary frequency analysis may assist testers to assess their texts 
more deeply. 

Implications 

Such detailed assessment of a reading test in terms of its validity and reliability is 
highly recommended for researchers who are in need of preparing pre and post tests for 
experimental studies. Then, they will be able to administer pre and post tests which are both 
different from and identical to each other. However, it might be very tiring for reading 
teachers to administer such detailed analysis for their reading tests. 

Due to their profession, researchers might be aware of the importance of establishing 
validity and reliability for their reading tests; however, this may not be the case for teachers as 
their principal goal is teaching rather than researching. Nevertheless, teachers should also be 
encouraged to use valid and reliable tests to assess their students’ reading skills. It might be 
beneficial to assist reading teachers at any grade to achieve this goal by the help of in-service 
training. 

In case of failure in providing in-service training to professionals on assessing validity 
and reliability of reading tests, it might be beneficial to form databases which constitute of 
valid and reliable reading tests. Being able to have an access to such databases will allow 
teachers, researchers, and also testers to select the most appropriate reading test in accordance 
with their needs. As cooperation with colleagues is one of the essential elements of 
establishing validity of a reading test, such collaboration among colleagues should be 
encouraged to establish more valid reading tests. 

Doubtless, the process of identifying vocabulary frequency in a reading test is both 
tedious and time-consuming. Therefore, computer programmers can be encouraged to add a 
feature to their word processors to calculate vocabulary frequency of a reading test which is 
very similar in principle to calculating reliability of a text in Microsoft Word®. Then, the 
easiness of receiving vocabulary frequency level may also encourage reading teachers to 
assess their texts also in terms of vocabulary frequency. 

Moreover, in order to evaluate frequency of vocabulary scores consistently, there is a 
need of developing sample criteria. Then, further researchers may calculate vocabulary 
frequency of a variety of texts from a broad range, and correlate them with different levels of 
language learning. 
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